What are CRM Analytics?¶

CRM analytics are data that demonstrate your company’s sales and customer service performance. CRM analytics also presents customer data that you can use to inform smarter business decisions. Typically, you’ll use CRM software to obtain CRM analytics and automate all your data collection and report generation.

Benefits of CRM Analysis¶

The primary benefit of CRM analysis is that you can use it to inform your sales, customer service and marketing processes. You can use your CRM analytics to improve your methods via:

Customer service evaluations. CRM analytics fill you in on your customer service team’s performance. If you see figures that your team could improve, implement practices that push your team toward these goals.

Accurate customer data. Whether you’re using your customer data for demographic marketing or email marketing, you need to know whether you’re reaching the right person. CRM analysis ensures you’re doing just that.

Thorough customer analytics. How much does your customer usually spend per quarter with you? Are they buying the same products time and time again, or does it vary? With CRM analytics, you’ll get firm answers to these questions, and you can use what you learn to refine your marketing strategies.

Efficient lead generation. Your CRM analysis can tell you which of your marketing efforts most strongly correlate to purchases. If you see one approach correlating strongly to purchases but have only targeted a sliver of your customers with that approach, try that method more – your sales might increase.

Libraries and Utilities¶

In [11]:
pip install black
Requirement already satisfied: black in c:\users\jki\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: click>=8.0.0 in c:\users\jki\anaconda3\lib\site-packages (from black) (8.0.4)
Requirement already satisfied: mypy-extensions>=0.4.3 in c:\users\jki\anaconda3\lib\site-packages (from black) (0.4.3)
Requirement already satisfied: packaging>=22.0 in c:\users\jki\anaconda3\lib\site-packages (from black) (23.0)
Requirement already satisfied: pathspec>=0.9.0 in c:\users\jki\anaconda3\lib\site-packages (from black) (0.10.3)
Requirement already satisfied: platformdirs>=2 in c:\users\jki\anaconda3\lib\site-packages (from black) (2.5.2)
Requirement already satisfied: colorama in c:\users\jki\anaconda3\lib\site-packages (from click>=8.0.0->black) (0.4.6)
Note: you may need to restart the kernel to use updated packages.
In [12]:
import os
import datetime
import squarify
import warnings
import pandas as pd 
import numpy as np
import datetime as dt
from operator import attrgetter
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import plotly.graph_objs as go
from plotly.offline import iplot
from sklearn.metrics import (silhouette_score,
                             calinski_harabasz_score,
                             davies_bouldin_score)
from lifetimes import BetaGeoFitter, GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
%matplotlib inline
%load_ext nb_black
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
palette = 'Set2'
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 20
     18 from lifetimes.plotting import plot_period_transactions
     19 get_ipython().run_line_magic('matplotlib', 'inline')
---> 20 get_ipython().run_line_magic('load_ext', 'nb_black')
     21 warnings.filterwarnings('ignore')
     22 sns.set_style('whitegrid')

File ~\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py:2414, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2412     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2413 with self.builtin_trap:
-> 2414     result = fn(*args, **kwargs)
   2416 # The code below prevents the output from being displayed
   2417 # when using magics with decodator @output_can_be_silenced
   2418 # when the last Python token in the expression is a ';'.
   2419 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~\anaconda3\Lib\site-packages\IPython\core\magics\extension.py:33, in ExtensionMagics.load_ext(self, module_str)
     31 if not module_str:
     32     raise UsageError('Missing module name.')
---> 33 res = self.shell.extension_manager.load_extension(module_str)
     35 if res == 'already loaded':
     36     print("The %s extension is already loaded. To reload it, use:" % module_str)

File ~\anaconda3\Lib\site-packages\IPython\core\extensions.py:76, in ExtensionManager.load_extension(self, module_str)
     69 """Load an IPython extension by its module name.
     70 
     71 Returns the string "already loaded" if the extension is already loaded,
     72 "no load function" if the module doesn't have a load_ipython_extension
     73 function, or None if it succeeded.
     74 """
     75 try:
---> 76     return self._load_extension(module_str)
     77 except ModuleNotFoundError:
     78     if module_str in BUILTINS_EXTS:

File ~\anaconda3\Lib\site-packages\IPython\core\extensions.py:91, in ExtensionManager._load_extension(self, module_str)
     89 with self.shell.builtin_trap:
     90     if module_str not in sys.modules:
---> 91         mod = import_module(module_str)
     92     mod = sys.modules[module_str]
     93     if self._call_load_ipython_extension(mod):

File ~\anaconda3\Lib\importlib\__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1140, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'nb_black'
In [19]:
import pandas as pd

# Try reading the CSV file with different encodings
try:
    crm_data_df = pd.read_csv(r"C:\Users\jki\Downloads\crm data.csv", encoding='utf-8')
except UnicodeDecodeError:
    crm_data_df = pd.read_csv(r"C:\Users\jki\Downloads\crm data.csv", encoding='latin1')

crm_data_df.head(5)
Out[19]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850.0 United Kingdom

Understanding Data¶

Context¶

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Variable Description¶

InvoiceNo: Invoice number that consists 6 digits. If this code starts with letter 'c', it indicates a cancellation.

StockCode: Product code that consists 5 digits.

Description: Product name.

Quantity: The quantities of each product per transaction.

InvoiceDate: Represents the day and time when each transaction was generated.

UnitPrice: Product price per unit.

CustomerID: Customer number that consists 5 digits. Each customer has a unique customer ID.

Country: Name of the country where each customer resides.

In [20]:
world_map = crm_data_df[['CustomerID', 'InvoiceNo', 'Country']
              ].groupby(['CustomerID', 'InvoiceNo', 'Country']
                       ).count().reset_index(drop = False)
countries = world_map['Country'].value_counts()
data = dict(type='choropleth',
            locations = countries.index,
            locationmode = 'country names',
            z = countries,
            text = countries.index,
            colorbar = {'title':'Orders'},
            colorscale='Viridis',
            reversescale = False)

layout = dict(title={'text': "Number of Orders by Countries",
                     'y':0.9,
                     'x':0.5,
                     'xanchor': 'center',
                     'yanchor': 'top'},
              geo = dict(resolution = 50,
                         showocean = True,
                         oceancolor = "LightBlue",
                         showland = True,
                         landcolor = "whitesmoke",
                         showframe = True),
             template = 'plotly_white',
             height = 600,
             width = 1000)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate = False)

Descriptive Statistics¶

In [22]:
def desc_stats(dataframe):
    desc_df = pd.DataFrame(index= dataframe.columns, 
                           columns= dataframe.describe().T.columns,
                           data= dataframe.describe().T)
    
    f,ax = plt.subplots(figsize=(10,
                                 desc_df.shape[0] * 0.81))
    sns.heatmap(desc_df,
                annot = True,
                cmap = "Greens",
                fmt = '.2f',
                ax = ax,
                linecolor = 'white',
                linewidths = 1.1,
                cbar = False,
                annot_kws = {"size": 12})
    plt.xticks(size = 18)
    plt.yticks(size = 14,
               rotation = 0)
    plt.title("Descriptive Statistics", size = 14)
    plt.show()
    
desc_stats(crm_data_df.select_dtypes(include = [float, int]))

📌 It is clearly seems that there are ouliters in Quantity and UnitPrice that have to be handled

📌 There are negative values in UnitPrice and Quantity because of cancelled orders.

📌 Missing values in Customer ID and Description.

📌 Quantity and Unit Price should be multiplied in order to create Total Price.

Data Preprocessing¶

In [25]:
def replace_with_thresholds(dataframe, variable, q1 = 0.25, q3 = 0.75):
    
    '''
    Detects outliers with IQR method and replaces with thresholds 
    
    '''
    
    df_ = dataframe.copy()
    quartile1 = df_[variable].quantile(q1)
    quartile3 = df_[variable].quantile(q3)
    iqr = quartile3 - quartile1
    
    up_limit = quartile3 + 1.5 * iqr
    low_limit = quartile1 - 1.5 * iqr
    df_.loc[(df_[variable] < low_limit), variable] = low_limit
    df_.loc[(df_[variable] > up_limit), variable] = up_limit
    
    return df_

def ecommerce_preprocess(dataframe):
    df_ = dataframe.copy()
    
    #Missing Values
    df_ = df_.dropna()
    
    #Cancelled Orders & Quantity
    df_ = df_[~df_['InvoiceNo'].str.contains('C', na = False)]
    df_ = df_[df_['Quantity'] > 0]
    
    #Replacing Outliers
    df_ = replace_with_thresholds(df_, "Quantity", q1 = 0.01, q3 = 0.99)
    df_ = replace_with_thresholds(df_, "UnitPrice", q1 = 0.01, q3 = 0.99)
    
    #Total Price
    df_["TotalPrice"] = df_["Quantity"] * df_["UnitPrice"]
    
    return df_
In [26]:
crm_data_df= ecommerce_preprocess(crm_data_df)

desc_stats(crm_data_df.select_dtypes(include = [float, int]))

RFM Analysis¶

RFM stands for Recency, Frequency, and Monetary value, each corresponding to some key customer trait. These RFM metrics are important indicators of a customer’s behavior because frequency and monetary value affects a customer’s lifetime value, and recency affects retention, a measure of engagement.

RFM factors illustrate these facts:

The more recent the purchase, the more responsive the customer is to promotions The more frequently the customer buys, the more engaged and satisfied they are Monetary value differentiates heavy spenders from low-value purchasers

RFM Metrics¶

In [27]:
print(crm_data_df['InvoiceDate'].max())
9/9/2011 9:52
In [34]:
import pandas as pd
import datetime as dt

# Assuming 'InvoiceDate' is a string column representing dates
crm_data_df['InvoiceDate'] = pd.to_datetime(crm_data_df['InvoiceDate'])

today_date = dt.datetime(2011, 9, 9)

# Drop rows with NaN values
crm_data_df.dropna(subset=['InvoiceDate', 'InvoiceNo', 'TotalPrice'], inplace=True)

rfm = crm_data_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (today_date - x.max()).days,
                                             'InvoiceNo': lambda x: x.nunique(),
                                             'TotalPrice': lambda x: x.sum()})

rfm.columns = ['recency', 'frequency', 'monetary']

# Filter 'monetary' column based on the condition
rfm['monetary'] = rfm['monetary'][rfm['monetary'] > 0]

rfm = rfm.reset_index()

rfm.head()
Out[34]:
CustomerID recency frequency monetary
0 12346.0 233 1 310.44
1 12347.0 -90 7 4310.00
2 12348.0 -17 4 1770.78
3 12349.0 -74 1 1491.72
4 12350.0 218 1 331.46

RFM Scores¶

In [35]:
def get_rfm_scores(dataframe) -> pd.core.frame.DataFrame:

    df_ = dataframe.copy()
    df_["recency_score"] = pd.qcut(df_["recency"], 5, labels=[5, 4, 3, 2, 1])
    df_["frequency_score"] = pd.qcut(
        df_["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5]
    )
    df_["monetary_score"] = pd.qcut(df_["monetary"], 5, labels=[1, 2, 3, 4, 5])
    df_["RFM_SCORE"] = df_["recency_score"].astype(str) + df_["frequency_score"].astype(
        str
    )

    return df_


rfm = get_rfm_scores(rfm)

Segmentation¶

In [36]:
seg_map = {r'[1-2][1-2]': 'hibernating',
           r'[1-2][3-4]': 'at_Risk',
           r'[1-2]5': 'cant_loose',
           r'3[1-2]': 'about_to_sleep',
           r'33': 'need_attention',
           r'[3-4][4-5]': 'loyal_customers',
           r'41': 'promising',
           r'51': 'new_customers',
           r'[4-5][2-3]': 'potential_loyalists',
           r'5[4-5]': 'champions'}

rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex = True)

rfm.head()
Out[36]:
CustomerID recency frequency monetary recency_score frequency_score monetary_score RFM_SCORE segment
0 12346.0 233 1 310.44 1 1 2 11 hibernating
1 12347.0 -90 7 4310.00 5 5 5 55 champions
2 12348.0 -17 4 1770.78 2 4 4 24 at_Risk
3 12349.0 -74 1 1491.72 4 1 4 41 promising
4 12350.0 218 1 331.46 1 1 2 11 hibernating

Segmentation Map¶

In [37]:
segments = rfm["segment"].value_counts().sort_values(ascending=False)
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 10)
squarify.plot(
    sizes=segments,
    label=[label for label in seg_map.values()],
    color=[
        "#AFB6B5",
        "#F0819A",
        "#926717",
        "#F0F081",
        "#81D5F0",
        "#C78BE5",
        "#748E80",
        "#FAAF3A",
        "#7B8FE4",
        "#86E8C0",
    ],
    pad=False,
    bar_kwargs={"alpha": 1},
    text_kwargs={"fontsize": 15},
)
plt.title("Customer Segmentation Map", fontsize=20)
plt.xlabel("Frequency", fontsize=18)
plt.ylabel("Recency", fontsize=18)
plt.show()

Model Evaluation¶

In [38]:
print(' RFM Model Evaluation '.center(70, '='))
X = rfm[['recency_score', 'frequency_score']]
labels = rfm['segment']
print(f'Number of Observations: {X.shape[0]}')
print(f'Number of Segments: {labels.nunique()}')
print(f'Silhouette Score: {round(silhouette_score(X, labels), 3)}')
print(f'Calinski Harabasz Score: {round(calinski_harabasz_score(X, labels), 3)}')
print(f'Davies Bouldin Score: {round(davies_bouldin_score(X, labels), 3)} \n{70*"="}')
======================== RFM Model Evaluation ========================
Number of Observations: 4339
Number of Segments: 10
Silhouette Score: 0.471
Calinski Harabasz Score: 4591.503
Davies Bouldin Score: 0.633 
======================================================================

Segment Analysis¶

In [39]:
rfm[['recency','monetary','frequency','segment']]\
.groupby('segment')\
.agg({'mean','std','max','min'})
Out[39]:
recency monetary frequency
min std max mean min std max mean min std max mean
segment
about_to_sleep -59 10.943474 -21 -39.687500 6.20 533.161343 6207.670 469.058097 1 0.368913 2 1.161932
at_Risk -20 68.618828 281 60.785835 52.00 943.170824 11072.670 938.458341 2 0.954415 6 2.878583
cant_loose -20 65.250378 280 39.968254 70.02 2008.352354 10254.180 2646.822540 6 4.289630 34 8.380952
champions -92 3.683300 -80 -86.638231 198.23 18366.105822 266163.525 6498.612978 3 16.476484 210 12.417062
hibernating -20 92.013560 281 124.605042 3.75 553.714975 9182.345 398.573036 1 0.302492 2 1.101774
loyal_customers -78 15.577050 -21 -59.391941 36.56 5405.787839 120210.355 2752.519574 3 4.545669 63 6.479853
need_attention -59 11.552074 -21 -40.572193 6.90 732.298552 4353.830 847.657086 2 0.470081 3 2.326203
new_customers -92 3.902112 -80 -85.571429 89.94 174.620952 848.550 314.883690 1 0.000000 1 1.000000
potential_loyalists -92 9.338808 -60 -75.601240 20.80 601.746293 4628.750 674.628357 1 0.651401 3 2.010331
promising -78 5.238115 -60 -69.578947 30.00 223.634736 1491.720 285.623723 1 0.000000 1 1.000000
In [41]:
import seaborn as sns
import matplotlib.pyplot as plt

# Define a palette
palette = sns.color_palette("pastel")

# Your existing code
plt.figure(figsize=(18, 8))
ax = sns.countplot(data=rfm,
                   x='segment',
                   palette=palette)
total = len(rfm.segment)
for patch in ax.patches:
    percentage = '{:.1f}%'.format(100 * patch.get_height()/total)
    x = patch.get_x() + patch.get_width() / 2 - 0.17
    y = patch.get_y() + patch.get_height() * 1.005
    ax.annotate(percentage, (x, y), size=14)
plt.title('Number of Customers by Segments', size=16)
plt.xlabel('Segment', size=14)
plt.ylabel('Count', size=14)
plt.xticks(size=10)
plt.yticks(size=10)
plt.show()
In [42]:
plt.figure(figsize=(18, 8))
sns.scatterplot(
    data=rfm, x="recency", y="frequency", hue="segment", palette=palette, s=60
)
plt.title("Recency & Frequency by Segments", size=16)
plt.xlabel("Recency", size=12)
plt.ylabel("Frequency", size=12)
plt.xticks(size=10)
plt.yticks(size=10)
plt.legend(loc="best", fontsize=12, title="Segments", title_fontsize=14)
plt.show()
In [43]:
fig, axes = plt.subplots(1, 3, figsize=(18, 8))
fig.suptitle("RFM Segment Analysis", size=14)
feature_list = ["recency", "monetary", "frequency"]
for idx, col in enumerate(feature_list):
    sns.boxplot(
        ax=axes[idx], data=rfm, x="segment", y=feature_list[idx], palette=palette
    )
    axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=60)
    if idx == 1:
        axes[idx].set_ylim([0, 400])
    if idx == 2:
        axes[idx].set_ylim([0, 30])
plt.tight_layout()
plt.show()
In [44]:
fig, axes = plt.subplots(3, 1, figsize=(16, 12))
fig.suptitle('RFM Segment Analysis', size = 14)
feature_list = ['recency', 'monetary', 'frequency']
for idx, col in enumerate(feature_list):
    sns.histplot(ax = axes[idx], data = rfm,
                 hue = 'segment', x = feature_list[idx],
                 palette= palette)
    if idx == 1:
        axes[idx].set_xlim([0, 400])
    if idx == 2:
        axes[idx].set_xlim([0, 30])
plt.tight_layout()
plt.show()
C:\Users\jki\AppData\Local\Temp\ipykernel_16348\1358203543.py:12: UserWarning:

Creating legend with loc="best" can be slow with large amounts of data.

C:\Users\jki\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:152: UserWarning:

Creating legend with loc="best" can be slow with large amounts of data.

Cohort Analysis¶

A cohort is a group of people sharing something in common, such as the sign-up date to an app, the month of the first purchase, geographical location, acquisition channel (organic users, coming from performance marketing, etc.) and so on. In Cohort Analysis, we track these groups of users over time, to identify some common patterns or behaviors

In [46]:
def CohortAnalysis(dataframe):

    data = dataframe.copy()
    data = data[["CustomerID", "InvoiceNo", "InvoiceDate"]].drop_duplicates()
    data["order_month"] = data["InvoiceDate"].dt.to_period("M")
    data["cohort"] = (
        data.groupby("CustomerID")["InvoiceDate"].transform("min").dt.to_period("M")
    )
    cohort_data = (
        data.groupby(["cohort", "order_month"])
        .agg(n_customers=("CustomerID", "nunique"))
        .reset_index(drop=False)
    )
    cohort_data["period_number"] = (cohort_data.order_month - cohort_data.cohort).apply(
        attrgetter("n")
    )
    cohort_pivot = cohort_data.pivot_table(
        index="cohort", columns="period_number", values="n_customers"
    )
    cohort_size = cohort_pivot.iloc[:, 0]
    retention_matrix = cohort_pivot.divide(cohort_size, axis=0)
    with sns.axes_style("white"):
        fig, ax = plt.subplots(
            1, 2, figsize=(12, 8), sharey=True, gridspec_kw={"width_ratios": [1, 11]}
        )
        sns.heatmap(
            retention_matrix,
            mask=retention_matrix.isnull(),
            annot=True,
            cbar=False,
            fmt=".0%",
            cmap="coolwarm",
            ax=ax[1],
        )
        ax[1].set_title("Monthly Cohorts: User Retention", fontsize=14)
        ax[1].set(xlabel="# of periods", ylabel="")
        white_cmap = mcolors.ListedColormap(["white"])
        sns.heatmap(
            pd.DataFrame(cohort_size).rename(columns={0: "cohort_size"}),
            annot=True,
            cbar=False,
            fmt="g",
            cmap=white_cmap,
            ax=ax[0],
        )
        fig.tight_layout()
    
CohortAnalysis(crm_data_df)

Customer Lifetime Value¶

Customer lifetime value is how much money a customer will bring your brand throughout their entire time as a paying customer.

In [51]:
cltv_df = crm_data_df.groupby("CustomerID").agg(
    {
        "InvoiceDate": [
            lambda x: (x.max() - x.min()).days,
            lambda x: (today_date - x.min()).days,
        ],
        "InvoiceNo": "nunique",
        "TotalPrice": "sum",
    }
)

# Droplevel and rename columns
cltv_df.columns = cltv_df.columns.droplevel(0)
cltv_df.columns = ["recency", "T", "frequency", "monetary"]

cltv_df.head()
Out[51]:
recency T frequency monetary
CustomerID
12346.0 0 233 1 310.44
12347.0 365 275 7 4310.00
12348.0 282 266 4 1770.78
12349.0 0 -74 1 1491.72
12350.0 0 218 1 331.46
In [58]:
#Average Order Value
cltv_df["monetary"] = cltv_df["monetary"] / cltv_df["frequency"]

#Recency & Tenure
cltv_df["recency"] = cltv_df["recency"] / 7
cltv_df["T"] = cltv_df["T"] / 7

#Frequency
cltv_df = cltv_df[(cltv_df['frequency'] > 1)]
In [ ]: